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Abstract 

Symbolic dynamics has proven to be an invaluable tool in analyzing the mecha- 
nisms that lead to unpredictability and random behavior in nonlinear dynamical 
systems. Surprisingly, a discrete partition of continuous state space can produce 
a coarse-grained description of the behavior that accurately describes the invari- 
ant properties of an underlying chaotic attractor In particular, measures of the 
rate of information production — the topological and metric entropy rates — can be 
estimated from the outputs of Markov or generating partitions. Here we develop 
Bayesian inference for fc-th order Markov chains as a method to finding generat- 
ing partitions and estimating entropy rates from finite samples of discretized data 
produced by coarse-grained dynamical systems. 



1 Introduction 



Research on chaotic dynamical systems during the last forty years produced a new vision of the 
origins of randomness. It is now widely understood that observed randomness can be generated by 
low-dimensional deterministic systems that exhibit a chaotic attractor. Today, when confronted with 
what appears to be a high-dimensional stochastic process, one now asks whether or not the process 
is instead a hidden low-dimensional, but nonlinear dynamical system. This awareness, though, 
requires a new way of looking at apparently random data since chaotic dynamics are very sensitive 
to the measurement process 1 1 1, which is both a blessing and a curse, as it turns out. 

SymboUc dynamics, as one of a suite of tools in dynamical systems theory, in its most basic form 
addresses this issue by considering a coarse-grained view of a continuous dynamics. Q In this sense, 
any finite-precision instrument that measures a chaotic system induces a symbolic representation of 
the underlying continuous-valued behavior. 



'For a recent overview consult (2) and for a review of current applications see |^ and references therein. 



To effectively model time series of discrete data from a continuous-state system two concerns must 
be addressed. First, we must consider the measurement instrument and the representation of the true 
dynamics which it provides. Second, we must consider the inference of models based on this data. 
The relation between these steps is more subtle than one might expect. As we will demonstrate, on 
the one hand, in the measurement of chaotic data, the instrument should be designed to maximize 
the entropy rate of the resulting data stream. This allows one to extract as much information from 
each measurement as possible. On the other hand, model inference strives to minimize the apparent 
randomness (entropy rate) over a class of alternative models. This reflects a search for determinism 
and structure in the data. 

Here we address the interplay between optimal instruments and optimal models by analyzing a 
relatively simple nonlinear system. We consider the design of binary-output instruments for chaotic 
maps with additive noise. We then use Bayesian inference of a k-th order Markov chain to model the 
resulting data stream. Our model system is a one-dimensional chaotic map with additive noise [4^^ 

xt+i f{xt)+£,t , (1) 

where t — 0,1,2, . . ., xt G [0, 1], and ~ N(0, cr^) is Gaussian random variable with mean zero 
and variance a^. To start we consider the design of instruments in the zero-noise limit. This is the 
regime of most previous work in symbolic dynamics and provides a convenient frame of reference. 

The construction of a symbolic dynamics representation of a continuous-state system goes as follows 
||2l- We assume time is discrete and consider a map / from the state space M to itself f : M 
M. This space can partitioned into a finite set V — {li : Dili = M,Ii D Ij = 0,i j} of 
nonoverlapping regions in many ways. The most powerful is called a Markov partition and must 
satisfy two conditions. First, the image of each region li must be a union of intervals: /(/;) = 
Uj Ij , V i. Second, the map /(/;), restricted to an interval, must be one-to-one and onto. If a Markov 
partition cannot be found for the system under consideration, the next best coarse-graining is called 
a generating partition. For one dimensional maps, these are often easily found using the extrema 
of f{x) — its critical points. The critical points in the map are used to divide the state space into 
intervals li over which / is monotone. Note that Markov partitions are generating, but the converse 
is not generally true. 

Given any partition V = {li}, then, a series of continuous-valued states X = xqXi . . . xjv-i can 
be projected onto its symbolic representation S = sqSi . . . sjv-i- The latter is simply the associated 
sequence of partition-element indices. This is done by defining an operator TT{xt) ~ st that returns 
a unique symbol st — i for each li from an alphabet A when xt £ li. 

The central result in symbolic dynamics establishes that, using a generating partition, increasingly 
long sequences of observed symbols identify smaller and smaller regions of the state space. Starting 
the system in such a region produces the associated measurement symbol sequence. In the limit 
of infinite symbol sequences, the result is a discrete-symbol representation of a continuous-state 
system — a representation that, as we will show, is often much easier to analyze. In this way a 
chosen partition creates a symbol sequence 7r(X) = S which describes the continuous dynamics as 
a sequence of symbols. The choice of partition then is equivalent to our instrument-design problem. 

The effectiveness of a partition (in the zero noise limit) can be quantified by estimating the en- 
tropy rate of the resulting symbolic sequence. To do this we consider length-i words = 
SiSi+i . . . Si+L-i- The block entropy of length-L sequences obtained from partition V is then 

Hl{P) = - P(s^)log2P(s''), (2) 

where p(s^) is the probability of observing the word G A^. From the block entropy the entropy 
rate can be estimated as the following limit 

h^iV) = lim . (3) 

L — >oc L/ 

In practice it is often more accurate to calculate the length-L estimate of the entropy rate using 

KL{r)=HL{V)~HL-i{V) . (4) 



Another key result in symbolic dynamics says that the entropy of the original continuous system is 
found using generating partitions El?). In particular, the true entropy rate maximizes the estimated 



entropy rates: 



hfj, = max h^iV) 



(5) 



Thus, translated into a statement about experiment design, the resuhs tell us to design an instrument 
so that it maximizes the observed entropy rate. This reflects the fact that we want each measurement 
to produce the most information possible. 

As a useful benchmark on this, useful only in the case when we know f{x), Piesin's Identity (81 
tells us that the value of is equal the sum of the positive Lyapunov characteristic exponents: 
hfi = A^. For one-dimensional maps there is a single Lyapunov exponent which is numerically 
estimated from the map / and observed trajectory {xt} using 

^= J™ T7El°g2l/'(^t)l- (6) 

t=l 



Taken altogether, these results tell us how to design our instrument for effective observation of 
deterministic chaos. Notably, in the presence of noise no such theorems exist. However, ^ |5l 
demonstrated the methods developed above are robust in the presence of noise. 

In any case, we view the output of the instrument as a stochastic process. A sample realization D 
of length N with measurements taken from a finite alphabet is the basis for our inference problem: 
D = sgSi • ■ • sjv-i , St € A. For our purposes here, the sample is generated by a partition of 
continuous-state sequences from iterations of a one-dimensional map and that states are on a chaotic 
attractor. This means, in particular, that the stochastic process is stationary. We assume, in addition, 
that the alphabet is binary A = {0, 1}. 



2 Bayesian inference of k-ih order Markov chains 

Given a method for instrument design the next step is to estimate a model from the observed mea- 
surements. Here we choose to use the model class of fc-order Markov chains and Bayesian inference 
as the model estimation and selection paradigm. 

The fc-th order Markov chain model class makes two strong assumptions about the data sample. The 
first is an assumption of finite memory. In other words, the probability of st depends only on the 
previous k symbols in the data sample. We introduce the more compact notation — st-k-i ■ ■ - St 
to indicate a length-fc sequence of measurements ending at time t. The finite memory assumption 
is then equivalent to saying the probability of the observed data can be factored into the product of 
terms with the form p{st\^f). The second assumption is stationarity. This means the probability of 
observed sequences does not change with the time position in the data sample: p(st|V() — p(s| V"') 
for any index t. As noted above, this assumption is satisfied by the data streams produced. The 
first assumption, however, is often not true of chaotic systems. They can generate time series with 
infinitely long temporal correlations. Thus, in some cases, we may be confronted with out-of-class 
modeling. 

The k-th order Markov chain model class has a set of parameters 9k = {p{s\^'^) : s e 
A, e A''}. In the Bayesian inference of the model parameters 9k we must write down the 
likelihood P{D\9k,Mk) and the prior P{6k\Mk) and then calculate the evidence P{D\Mk). The 
posterior distribution P{9k\D, M^) is obtained from Bayes' theorem 

^ ' ' ^ P{D\Mk) 

The posterior describes the distribution of model parameters 9k given the model class and 
observed data D. From this the expectation of the model parameters can be found along with 
estimates of the uncertainty in the expectations. In the following sections we outline the specification 
of these quantities following [9. .lOJ . 



2.1 Likelihood 

Within the model class, the likelihood of an observed data sample is given by 

p{D\ek,Mk) ^ l[ n (8) 



seA *T''eA'' 

where n( V^s) is the number of times the word V^'s occurs in sample D. We note that Eq. ([8]l is 

■ fc 
fe 



conditioned on the start sequence = sqSi . . . Sk-i. 



^,noMs\^')] = . (10) 



2.2 Prior 

The prior is used to describe knowledge about the model class. In the case of the M^: model class, 
we choose a product of Dirichlet distributions — the so-called conjugate prior citations. Its form is 

where a(V^) = '^seA'^C^'^^) ^i^) gamma function. The prior's parameters 

{a( V's) : s G V*^ S A''} are assigned to reflect knowledge of the system at hand and must be 
real and positive. An intuition for the meaning of the parameters can be obtained by considering the 
mean of the Dirichlet prior, which is 

a{ s *-') 

In practice, a common assignment is a(*i''^s) = 1 for all parameters. This produces a uniform prior 
over the model parameters, reflected by the expectation Eprior[p(s I V^)] = 1/|^|- Unless otherwise 
stated, all inference in the following uses the uniform prior 

2.3 Evidence 

The evidence can be seen as a simple normalization term in Bayes theorem. However, when model 
comparison of different orders and estimation of entropy rates is considered, this term becomes a 
fundamental part of the analysis. The evidence is defined 

P{D\Mk)=J d9k P{D\9k,Mk)P{0k\Mk) , (11) 

It gives the probability of the data D given the model order M^. For the likelihood and prior derived 
above, the evidence is found analytically 

2.4 Posterior 

The posterior distribution is constructed from the elements derived above according to Bayes' the- 
orem Eq. O, resulting in a product of Dirichlet distributions. This form is a result of choosing the 
conjugate prior and generates the familiar form 



p{ek\DMk) = n 



X S{l-J2 Pis]'^'')) n p(s| V'^r^'^'^^+^^^'^^-i . (13) 

sGA seA 

The mean for the model parameters 9k according to the posterior distribution is then 

Given these estimates of the model parameters 6^, the next step is decide which order k is best for a 
given data sample. 



3 Model comparison of orders k 



Bayesian model comparison is very similar to the parameter estimation process discussed above. 
We start by enumerating the set of model orders to consider A4 = {M^ : k G [kmin, kmax]}- The 
probability of a particular order can be found by considering two factorings of the joint distribution 
P^Mk, D\M). Solving for the probability of a particular order we obtain 

where the denominator is given by the sum P{D\M) = X^m ,i^MP^-^\-^''^k' ■, M)P{Mki\M). 
This expression is driven by two components: the evidence P(I?|Mfe,7W) derived above and the 
prior over model orders P{Mk\M.). Two common priors are a uniform prior over orders and an 
exponential penalty for the size of the model P(A/fc|A1) = exp(— |Mj,|). For a fc-th order Markov 
chain the size of the model, or number of free parameters, is given by \Mk\ = — 1). To 

illustrate the method we will consider only the prior over orders k with a penalty for model size. 



4 Estimating Entropy Rates 

The entropy rate of an inferred Markov chain can be estimated by extending the method for inde- 
pendent identically distributed (IID) models of discrete data 111] using type theory |12|. In simple 
terms, type theory shows that the probability of an observed sequence can be suggestively rewritten 
in terms of the Kullback-Leibler (KL) distance and the entropy rate Eq. (O. This form suggests a 
connection to statistical mechanics and this, in turn, allows us to find average information-theoretic 
quantities over the posterior by taking derivatives. In the large data limit, the KL distance vanishes 
and we are left with the desired estimation of the Markov chain's entropy rate. The complete de- 
velopment is beyond our scope here, but will appear elsewhere. However, we will provide a brief 
sketch of the derivation and quote the resulting estimator 

The connection we draw between inference and information theory starts by considering the product 
of the prior Eq. ^ and likelihood Eq. dgll P{ek\Mk)P{D\ekMk) ^ P{D,0k\Mk). This product 
forms a joint distribution over the observed data D and model parameters 9^ given the model class 
Mfc. Writing the normalization constant from the prior as Z to save space, this joint distribution 
can be written, without approximation, in terms of conditional relative entropies 1| •] and entropy 
rates h^[-] 

P{D,9k\Mk) = Z2"'^'«C^[^ll^l+''"l'51)2+l'^l'''^'C^['^ll^l+'*f['^l) , (16) 
where (3k — V= s ^i'^'^s) + «( V^s). The set of probabilities used above are 

where Q is the distribution defined by the posterior mean, [/ is a uniform distribution, and 
P ~ {p(V^),p(s| V^)} are the "true" parameters given the model class. The information theory 
quantities are given by 

V[Q\\P] = ^ 5(^^)^(51 V'=)log2^|^ (19) 
^ p{s\ s 

KiQ] = - ^ g(V'=)g(s|V'^-)iog2<7(s|V^). (20) 



The form of Eq. (fTST i and its relation to the evidence motivates the connection to statistical me- 
chanics. If we think of the evidence P{D\'M.k) — J ddkP{D,9k\M.k), as a partition function 
Z = P{D\M.k), the free energy for the inference problem is simply T = — logZ. Using conven- 
tional techniques from statistical mechanics, the expectation and variance of + hy\Ci\ are 
obtained by taking derivatives of T with respect to /3fc. In this sense X'[(5||P] + plays the 



role of an internal energy and /3fe is comparable to an inverse temperature. We take advantage of the 
known form for the evidence provided in Eq. (fT2] i to calculate the desired expectation resulting in 

E,o.t[D[Q\\P] + h[Q]] - [/3,9(V'=)] (21) 

S ^ 

where the polygamma function is defined as ip'-"'>{x) = (i"+^/(ia;"+^ logr(x). The meaning of the 
terms on the RHS of Eq. i2T[ is not immediately clear However, we can use an expansion of the 
n — polygamma function ■ip'-^\x) — log a; — l/2x + 0{x^^), which is valid for a; ^ 1, to find 
the asymptotic form 

Epo,,[2?[Q||P] + h^lQ] ] = Hk+i[Q] - Hk[Q] + ,^\A\\\A\ - 1) . (22) 

From this expansion we can see that the first two terms make up the entropy rate h^k[Q] = 
Hk+i[Q] — Hk[Q]- And the last term must be associated with the conditional relative entropy 
between the posterior mean estimate (PME) distribution Q and the true distribution P. 

5 Experimental Setup 

Now that we have our instrument design and model inference methods fully specified we can 
describe the experimental setup used to test them. Data from simulations of the logistic one- 
dimensional map, given by f{xt) — rxt{l — xt), at the chaotic value of r = 4.0 was the basis 
for the analysis. A fluctuation level of cr = 10"'^ was used for the added noise. A random initial 
condition in the unit interval was generated and one thousand transient steps, not analyzed, were 
generated to find a typical state on the chaotic attractor Next, a single time series xq,xi, . . . , xn-i 
of length N = 10* was produced. 

A family of binary partitions Vid) — {"0" ^ x G [O.d), "1" ~ a; G [o?, 1]} of the continuous- 
valued states was produced for two hundred decision points d between and 1. That is, values in 
the state time series which satisfied xt < d were assigned symbol and all others were assigned 1. 
Given the symbolic representation of the data for a particular partition V{d), Markov chains from 
order A: = 1 to A: = 8 were inferred and model comparison was used to select the order that most 
effectively described the data. Then, using the selected model, values of entropy rate hf^{d) versus 
decision point d were produced. 

6 Results 

The results of our experiments are presented in Fig. [T] The bottom panel of Fig. |l(a)| shows the 
entropy rate /i^(c?) versus decision point estimated using Eq. ( |2TI ). Note the nontrivial d depen- 
dence of hJd). The dashed line shows an accurate numerical estimate of the Lyapunov exponent 
using Eq. It is also known to be A = 1 bit per symbol from analytic results. We note that h^{d) 
is zero at the extremes of d = and d = 1; the data stream there is all Is or all Os, respectively. 
The entropy rate estimate reaches a maximum at d = 1/2. For this decision point the estimated 
entropy rate is approximately equal to the Lyapunov exponent, indicating this instrument results in 
a generating partition and satisfies Piesin's identity. In fact, this value of d is also known to produce 
a Markov partition. 

The top panel of Fig. |l(a)| shows the Markov chain order k used to produce the entropy rate estimate 
for each value of d. This dependence on d is also complicated in ways one might not expect. The 
order k has two minima (ignoring d — Q and d — \)?Ad — 1/2 and d = /~^(l/2). These indicate 
that the model size is minimized for those instruments. This is another indication of the Markov 
partition for r — 4.0 and d — 1/2. These results confirm that the maximum entropy-rate instrument 
produces the most effect instrument for analysis of deterministic chaos in the presence of dynamical 
noise. The model order is minimized at the generating partition. 

Now let's consider the model order estimation process directly. The bottom panel of Fig. |l(b)| shows 
the estimated entropy rate h^{k) versus model order for four different decision points. A relative 




I • 1 ■ 1 ■ 1 ■ ' • 1 I . I . I . I . I I 

0.2 0.4 0.0 0.8 1 2 4 6 8 

d (Decision Point) j^i Order 

(a) Instrument design. (b) Model selection. 

Figure 1: Analysis of a single data stream of length N = 10'' from the logistic map at r = 4.0 with 
a noise level a = 10^'^. Two hundred evenly spaced decision points d E [0, 1] were used to define 
measurement partitions. 



minimum in the entropy rate for a given d selects the model order This reflects an optimization 
for the most structure and smallest Markov chain representation of the data produced by a given 
instrument. The top panel in this figure shows the model probability versus k for the same set of 
decision points, illustrating exactly this point. The prior over model orders, which penalizes for 
model size, selects the Markov chain with lowest k and smallest entropy rate. 



7 Conclusion 

We analyzed the degree of randomness generated by deterministic chaotic systems with a small 
amount of additive noise. Appealing to the well developed theory of symbolic dynamics, we demon- 
strated that this required a two-step procedure: first, the careful design of a measuring instrument 
and, second, effective model-order inference from the resulting data stream. The instrument should 
be designed to be maximally informative and the model inference should produce the most compact 
description in the model class. In carrying these steps out an apparent conflict appeared: in the first 
step of instrument design, the entropy rate was maximized; in the second, it was minimized. More- 
over, it was seen that instrument design must precede model inference. In fact, performing the steps 
in the reverse order leads to nonsensical results, such as using the one or the other extreme decision 
point d — or d ^ 1. 

The lessons learned are very simply summarized: Use all of the data and nothing but the data. 
For deterministic chaos careful decision point analysis coupled with Bayesian inference and model 
comparison accomplishes both of theses goals. 
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